Intoduction

In the 21st century, people started travelling a lot, either for business, leisure, or adventure, with the number of passengers served by flights increasing exponentially. For the past 2 years, we were restricted from traveling due to the Pandemic and now restrictions have been eased and the demand to travel has gone back up.
The airlines deploy multiple machine learning models to update the flight ticket price numerous times over a day. They utilize hundreds of features available at their disposal, including all the flight details like departure and arrival city, time, date, user traffic on the day of booking, previous year’s demand over a specific period, etc. This led to many travel hacks claiming to reduce ticket prices for passengers.


SMART Questions

Here are the list of SMART Questions I am trying to answer:

  1. How expensive are business class seats compared to its economy counterparts?
  2. Do flights with stops offer a better price than non-stop flights?
  3. Which combination of days and time gives us the best price?
  4. What combination of variables gives us a best model to predict flight prices?
  5. What kind of model gives us a better prediction of flight prices?

Description of the Data

Understanding data and Preprocessing

The data has been downloaded from kaggle.com link to which has been provided in the reference section.

This dataset contains information about flight booking options from the website “Ease my trip” for flight travel between India’s top 6 metro cities. The data was collected for 50 days, from February 11th to March 31st, 2022.

It has 300,261 observations and 9 variables of the following:

Date : Date of departure

Dep_time : Departure time

From : Departing city

Arr_time : Arrival time

To : Arrival city

Airline : The name of the airline carrier for the journey.

Class : Business or economy seat.

Time_Taken : Flight duration

Stop : Number of stops between departure and arrival city.

Price : The cost of the ticket.


As with any data, the data I acquired needed to be cleaned and new features have been added to aid the analysis down the line.

The preprocessing :

  • Bucketing the time into categories (morning, afternoon, evening, night, and midnight)
  • Adding co-ordinates to the cities (Latitude, Longitude)
  • Finding distances between the cities
  • Extracting day, weekends, months from date feature
  • Creating the flight_code from airline code and flight number
  • Converting price from rupees to dollars, distance from kilometers to miles
  • Creating dummy variables (converting categorical variables to numerical)
  • Removing outlier values
  • Standardizing the numerical variables

In addition to the features present in the data, feature engineernig generated these new set of features

flight_code : flight code of a particular journey from source to destination

dep_time_cat : Departure time in category (Morning, Afternoon….)

arr_time_cat : Arrival time in category (Afternoon, Evening, Night….)

month : Month of the departure (February, March….)

day : Day of scheduled departure (Monday, Tuesday….)

Weekend: Is the departure day a weekend or not.

time_taken_hours : Duration of flight in hours

time_taken_minutes : Duration of flight in minutes

from_location : Geographical co-ordinates of departure city (Latitude, Longitude)

to_location : Geographical co-ordinates of arrival city (Latitude, Longitude)

distance : Distance of journey in miles

# # functions to preprocess
# #creating buckets for different departure times
# time_category <- function(x){
#   time <- as.integer(strsplit(x,":")[[1]][1])
#   if(time <= 5) {return("mid night")}
#   else if(time >5 & time <=10) {return ("morning")}
#   else if(time >10 & time <=15) {return ("afternoon")}
#   else if(time >15 & time <=20) {return("evening")}
#   else if(time >20){return("night")}
# }
# 
# 
# # co-ordinates of the city
# location <- function(city){
#   if(city == "Delhi")  {return("28.7041, 77.1025")}
#   else if(city == "Mumbai") {return("19.0760, 72.8777")}
#   else if(city == "Bangalore") {return("12.9716, 77.5947")}
#   else if(city == "Kolkata") {return("22.5726, 88.3639")}
#   else if(city == "Hyderabad") {return("17.3850, 78.4867")}
#   else if(city == "Chennai") {return("13.0827, 80.2707")}
# }
# 
# 
# # Finding distance between two co-orinates
# find_distance <- function(x){
#   from_location <- x["from_location"]
#   to_location <- x["to_location"]
# 
#   from_lat <- as.double(strsplit(from_location,", ")[[1]][1])
#   from_long <- as.double(strsplit(from_location,", ")[[1]][2])
#   to_lat <- as.double(strsplit(to_location,", ")[[1]][1])
#   to_long <- as.double(strsplit(to_location,", ")[[1]][2])
# 
#   distance <-round((distm(c(from_long,from_lat),c(to_long,to_lat), fun = distHaversine)[1]/1000),digits = 2)
#   return(distance)
# }
# 
# 
# # Reading the data
# data <- read.csv("C:/Users/LEGION/OneDrive/Desktop/GW/sem_1/intro to Data Science/midterm/data/airline.csv")
# 
# 
# # Feature engineering new columns for the Month, Day, and weekend
# data$month <- months(as.Date(data$date))
# data$day <- weekdays(as.Date(data$date))
# data$weekend <- ifelse(data$day == "Sunday" | data$day == "Saturday", 1, 0)
# 
# # Cleaning the feature price 
# data$price <- gsub(",", "", data$price)
# data$price <- as.integer(data$price/80)
# 
# # combining airline code and flight number to make the flight code
# data$flight_code <- paste(data$ch_code, "-", data$num_code)
# 
# 
# # removing unwanted columns
# data_1 <- subset(data, select = -c(ch_code, num_code))
# data_1$time_taken_hms <- hm(data_1$time_taken)
# 
# # Engineering two new features, duration of flight in hours and in minutes
# data_1$time_taken_minutes <- hour(data_1$time_taken_hms)*60 + minute(data_1$time_taken_hms)
# data_1$time_taken_hours <- round(data_1$time_taken_minutes/60,2)
# 
# data_1 <- subset(data_1, select = -c(time_taken))
# 
# 
# 
# # Categorising the departure and arrrival times
# 
# data_1["dep_time_cat"] <- apply(X = data_1["dep_time"], FUN = time_category, MARGIN = 1)
# data_1["arr_time_cat"] <- apply(X = data_1["arr_time"], FUN = time_category, MARGIN = 1)
# 
# 
# dep_time_1 <- as_hms(strptime(data_1$dep_time,format = "%H:%M"))
# data_1$dep_time <- dep_time_1
# 
# 
# arr_time_1 <- as_hms(strptime(data_1$arr_time,format = "%H:%M"))
# data_1$arr_time <- arr_time_1
# 
# # Adding from and to co-ordinates to the cities
# 
# data_1["from_location"] <- apply(X = data_1["from"], FUN = location, MARGIN = 1)
# data_1["to_location"] <- apply(X = data_1["to"], FUN = location, MARGIN = 1)
# 
# 
# # Adding new feature "distance" between departure and arrival city
# data_1['distance'] <- apply(X = data_1,FUN = find_distance, MARGIN = 1)
# 
# # Rearranging the columns in correct order
# data_1 <- data_1[, c("airline","flight_code","date","from","dep_time","dep_time_cat","to","arr_time","arr_time_cat",
#                  "stop","class","price","month","day","weekend","time_taken_hms","time_taken_hours",
#                  "time_taken_minutes", "from_location","to_location","distance")]
# 
# 
# # Saving the dataframe into a .csv file
# # write.csv(df_1, "C:/Users/LEGION/OneDrive/Desktop/GW/sem_1/intro to Data Science/midterm/data/final_data.csv", row.names=FALSE)
df <-  read.csv("../data/final_data.csv")
df$date <- as.Date(df$date, format =  "%Y-%m-%d")

factor_cols <- c("airline","flight_code","from","to","dep_time_cat","arr_time_cat",
                 "stop","class","month")

for (col in factor_cols){
  df[,col] <- as.factor(df[,col])
}

time_cols <- c("dep_time","arr_time")
for (col in time_cols){
  df[,col] <- chron(times = df[,col])
}


df$day <- factor(df$day, levels= c("Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"))


df$airline <- factor(df$airline, levels= c("Indigo","Air India", "GO FIRST", "Vistara", "AirAsia", "SpiceJet","StarAir","Trujet"))



xkabledplyhead(df, title = "Flight data")
Flight data
airline flight_code date from dep_time dep_time_cat to arr_time arr_time_cat stop class price month day weekend time_taken_hms time_taken_hours time_taken_minutes from_location to_location distance
Air India AI - 868 2022-02-11 Delhi 18:00:00 evening Mumbai 20:00:00 evening 0 business 320 February Friday 0 2H 0M 0S 2.00 120 28.7041, 77.1025 19.0760, 72.8777 718
Air India AI - 624 2022-02-11 Delhi 19:00:00 evening Mumbai 21:15:00 night 0 business 320 February Friday 0 2H 15M 0S 2.25 135 28.7041, 77.1025 19.0760, 72.8777 718
Air India AI - 531 2022-02-11 Delhi 20:00:00 evening Mumbai 20:45:00 evening 1 business 528 February Friday 0 24H 45M 0S 24.75 1485 28.7041, 77.1025 19.0760, 72.8777 718
Air India AI - 839 2022-02-11 Delhi 21:25:00 night Mumbai 23:55:00 night 1 business 556 February Friday 0 26H 30M 0S 26.50 1590 28.7041, 77.1025 19.0760, 72.8777 718
Air India AI - 544 2022-02-11 Delhi 17:15:00 evening Mumbai 23:55:00 night 1 business 584 February Friday 0 6H 40M 0S 6.67 400 28.7041, 77.1025 19.0760, 72.8777 718

Summary statistics


A high level overview of the data I am dealing with:

  1. The Departure time ranges from 00:10 AM in the Morning to 23:55 PM in the Night.

  2. And the Arrival time ranges from 00:05 AM in the Morning to 23:59 PM in the Night.

  3. There are 3 possible values for stops i.e 0(nonstop), 1, 2+ stops.

  4. There are two classes of tickets Business and Economy.

  5. Is the day of departure a weekend or not (has values 0 or 1)

  6. Time taken for the journey ranging from 50 minutes to 49.8 Hours

  7. The distance covered by the flight with a minimum of 181 to a maximum of 1100 miles.

#removing duplicated rows

df_1 <- distinct(df, airline,flight_code,date,from,dep_time,to,arr_time,class, .keep_all= TRUE)

EDA

Flight map

#mapping the major cities of india 

Delhi <- c(77.1025,28.7041)
Mumbai <- c(72.8777,19.0760)
Bangalore <- c(77.5947,12.9716)
Kolkata <- c(88.3639,22.5726)
Hyderabad <- c(78.4867,17.3850)
Chennai <- c(80.2707,13.0827)
  

# Data frame
data <- rbind(Delhi, Mumbai,Bangalore, Kolkata, Hyderabad, Chennai) %>% 
  as.data.frame()
colnames(data) <- c("long","lat")

# World map
map('world',
    col="#d5d6d8", fill=TRUE, bg="white", lwd=0.05,
    mar=rep(0,4),border=0,xlim = c(68,98),ylim=c(8,37) )

points(x=data$long, y=data$lat, col="black", cex=1.5, pch=20)


# Generate all pairs of coordinates
all_pairs <- cbind(t(combn(data$long, 2)), t(combn(data$lat, 2))) %>% as.data.frame()
colnames(all_pairs) <- c("long1","long2","lat1","lat2")


# add every connections:
for(i in 1:nrow(all_pairs)){
    plot_my_connection(all_pairs$long1[i], all_pairs$lat1[i], all_pairs$long2[i], 
                       all_pairs$lat2[i], col="black", lwd=1)
    }
 
# add points and names of cities
text(rownames(data), x=data$long, y=data$lat,  col="black", cex=0.5, pos=4)


I am trying to know the factors affecting the flight prices in these major cities of India.

Delhi : The Capital of India

Mumbai: The Financial capital of India

Kolkata: Biggest economy after Mumbai and Delhi

Bangalore: Known as silicon valley of India

Hyderabad: The city of Nizam’s and booming Tech Industry

Chennai: Known as the Detroit of Asia, for having several automobile manufacturing companies


Number of flights scheduled

xform <- list(categoryorder = "array",
              categoryarray = c("Indigo","Air India", "GO FIRST", "Vistara", 
                                   "AirAsia", "SpiceJet","StarAir","Trujet"))

fig <- plot_ly(x = air_line_cout$Var1, y = air_line_cout$Freq, type = 'bar', color = air_line_cout$Var1,
               colors = c("Indigo" = "#1f77b4","Air India" = "#ff7f0f", "GO FIRST" = "#2ca02c", "Vistara" = "#9467bd", 
                                   "AirAsia" = "#d62728", "SpiceJet" = "#8c564b","StarAir" = "#e377c2","Trujet" = "#7f7f7f"))

fig <- fig %>% layout(xaxis = list(title = "Airline",xform), title = "Number of flights scheduled by airlines",
                      yaxis = list(title = "flights scheduled every day"), showlegend = F)

# api_create(fig, filename = "Flight traffic share of each airlines")


fig

Indigo, one of the largest airlines in India has most flight traffic of 414 Scheduled flights everyday between these 6 cities , followed by Air India with 141 flights scheduled (recently acquired by TATA, a multi national conglomerate), and budget airlines like Go First, AirAsia and SpiceJet.

With Least number of flights operated by StarAir and TruJet as they are the latest entries to the airline industry, with just couple of flights in their fleet.


Cities with highest traffic

# number of flights between cities per day
city_freq <- rename(count(df_for_count, from, to), Freq = n)
city_freq <- city_freq[order(-city_freq$Freq),]
city_freq$Freq <- as.integer(city_freq$Freq/49)
row.names(city_freq) <- NULL
xkabledplyhead(city_freq, title = "Daily flight frequencies")
Daily flight frequencies
from to Freq
Delhi Mumbai 168
Mumbai Delhi 139
Bangalore Delhi 113
Kolkata Delhi 88
Delhi Bangalore 62

The Highest traffic is seen between the cities Delhi -> Mumbai , Mumbai -> Delhi and followed by Bangalore -> Delhi.


This Sankey chart shows the flow of flights (flight traffic) in between the cities.

SankeyDiagram(city_freq[, -3],link.color = "Source", 
              weights = city_freq $Freq,label.show.percentages = TRUE,
              variables.share.values = TRUE)


Factors affecting the price

1. Class

density <- density(df$price)

economy <- df[df$class == "economy",]
density_economy <- density(economy$price)
business <- df[df$class == "business",]
density_business <- density(business$price)


fig <- plot_ly(x = ~density_economy$x, y = ~density_economy$y, type = 'scatter', mode = 'lines', 
               name = 'Economy', fill = 'tozeroy')

fig <- fig %>% add_trace(x = ~density_business$x, y = ~density_business$y, name = 'Business', fill = 'tozeroy')

fig <- fig %>% layout(title = "Price distribution of economy and business class",xaxis = list(title = 'Price',range=c(0,1100)),yaxis = list(title = 'Density'))

fig

It is a fact that the passenger experience in business class is far superior than the experience in economy class, it is because of the service, food and the in flight entertainment. This all costs money and it is clearly visible from the plot above that ticket prices of business class lie towards the middle and right-side of the price scale ranging from as low as $200 to as much as $1000+, where as the whole of economy class lies almost to the left of $200.

xform <- list(categoryorder = "array",
              categoryarray = c("Indigo", "Air India","Vistara", "AirAsia","SpiceJet","TruJet"))



fig1 <- plot_ly(x = economy$airline, y = economy$price,type = "box", name = "economy")
fig1 <- fig1 %>% layout(title = "price range of airlines for economy and business class",
                        xaxis = list(title = "Airline", showline = T, linewidth = 0.5),
                      yaxis = list(title = "Price", showline = T, linewidth = 1,range=c(0,250)))

fig2 <- plot_ly(x = business$airline, y = business$price,type = "box", name = "business")
fig2 <- fig2 %>% layout(title = "price range of airlines for economy and business class",
                         xaxis = list(title = "Airline", showline = T, linewidth = 0.5),
                      yaxis = list(title = "price", showline = T, linewidth = 1,range=c(100,1100)))

fig2 <- fig2 %>% layout(xaxis = xform, yaxis = list(side = "right"))



fig <- subplot(fig1, fig2,shareX = T) %>% layout(title = 'Price range of business and economy tickets')

fig

In the Domestic market of india only Air India (India’s first private airline) and Vistara (Joint collaboration of singapore airlines and TATA) operate with Business class offerings.

The remaining carriers operate only within the economy segment with average prices ranging from around $30 to $100 which might feel insignificant to us, but it still is a big price to pay for people in India considering their average incomes.

It is also clear from the plot that prices of business and economy classes are significantly different from each other.


conducted an ANOVA test to see if class of the seat affect the price of ticket.

Null Hypothesis H0: There is no difference in mean prices of Business and Economy class seats

Alternate Hypothesis H1: The mean prices of Business and Economy seats are significantly different

i.e. Class of the seat has impact on the price.

anova_class = aov(price ~ class, data=df)
# print(summary(anova_class))
xkabledply(anova_class, title = "ANOVA result summary for class")
ANOVA result summary for class
Df Sum Sq Mean Sq F value Pr(>F)
class 1e+00 2.13e+10 2.13e+10 2193516 0
Residuals 3e+05 2.91e+09 9.69e+03 NA NA

With the P-value = e-16, I reject the Null Hypothesis and accept the Alternate Hypothesis that, the prices of Business and Economy classes are significantly different. Thus, class of seats impact the price of the ticket.

On an average, the price of Business class seats are 8 times more expensive than the economy seats.


2. COVID

2X increase in flights

fig <- plot_ly(date_count, type = 'scatter', mode = 'lines')%>%
  add_trace(x = ~date, y = ~flight_frequency,name = "flight departures",line = list(color = "1f77b4"),showlegend = FALSE)

fig <- fig %>% layout(title = "Flight frequency from Feb-11 to Mar-31")
options(warn = -1)


fig

It is easy to observe that during mid-February to late-March, flight frequency increased by double, from 513 to over 1000. So, it is obvious to see that flight frequency increased dramatically during this time.

vline <- function(x = 0, color = "black") {
  list(
    type = "line",
    y0 = 0,
    y1 = 1,
    yref = "paper",
    x0 = x,
    x1 = x,
    line = list(color = color)
  )
}


fig <- plot_ly()
# Add traces
fig <- fig %>% add_trace(x = ~covid_df$date, y = ~covid_df$new_cases, name = "covid cases", yaxis = "y2", mode = "lines", type = "scatter",fill = "tozeroy",fillcolor = "rgba(214,39,40,0.2)", line = list(color="d62728"))

ay <- list(
  tickfont = list(color = "black"),
  overlaying = "y",
  side = "left",
  title = "# covid +ve cases")

fig <- fig %>% add_trace(x = ~date_count$date, y = ~date_count$flight_frequency, name = "flights", mode = "lines", type = "scatter",line = list(color="1f77b4"))

# Set figure title, x and y-axes titles
fig <- fig %>% layout(
  title = "Effect of COVID and travel restrictions on flight frequency", yaxis2 = ay,
  xaxis = list(title="Date"),
  yaxis = list(side = "right",title="Flight frequency")
)%>%
  layout(plot_bgcolor='white',
          xaxis = list(
            zerolinecolor = '#ffff',
            zerolinewidth = 2,
            gridcolor = 'ffff'),
          yaxis = list(
            zerolinecolor = '#ffff',
            zerolinewidth = 2,
            gridcolor = 'ffff')
          )


fig <- fig %>% add_annotations(text = "COVID Restrictions relaxed", x = as.Date("2022-02-14"), y = 650,showarrow = T, ax = 85,ay = -70)

fig <- fig %>% add_annotations(text = "COVID 3rd wave in India", x = as.Date("2022-01-21"), y = 600, showarrow = F)

fig <- fig %>% layout(shapes = list(vline(as.Date("2022-02-14"))))

fig

The reasons are due to the COVID cases decreasing and relaxation of travel restrictions in India.
I downloaded the COVID dataset from Our World in data. I found an interesting relation between the number of COVID cases per day and flight frequency. We can observe from the plot, that COVID cases reached maximum of 347.2K, on January 20 and decreased to 50.4K by February 11. Besides, COVID travel restrictions were relaxed on February 14.


The drop in price

vline <- function(x = 0, color = "black") {
  list(
    type = "line",
    y0 = 0,
    y1 = 1,
    yref = "paper",
    x0 = x,
    x1 = x,
    line = list(color = color)
  )
}


fig <- plot_ly(price_df, x = ~date) 
fig <- fig %>% add_trace(y = ~avg_price, type = "scatter",name = '0 stops',mode = 'lines',
                         hovertemplate = paste('<b>Date</b>: %{x}<br>','<b>Average Price</b>: %{y:.2f}')) 

fig <- fig %>% add_annotations(text = "Covid Restrictions relaxed", x = as.Date("2022-02-14"), y = 100,showarrow = T, ax = 85,ay = 45)

fig <- fig %>% layout(shapes = list(vline(as.Date("2022-02-14"))), title = "Average Price vs Date",
                      xaxis = list(title = "Date"),yaxis = list(title = "Average Price")) 

fig

As for the consequence of increasing flight frequency, the prices of flight tickets dropped by about one third, from $182.66 to $58.37, within period mentioned.


I conducted a Two Sample T-Test to see if this price drop is significant enough.

Null Hypothesis H0 : There is no difference in price before and after travel restrictions were relaxed

Alternate Hypothesis H1 : There is a significant difference in the price before and after the travel restrictions were relaxed

with_restrictions <- df[df$date <= as.Date("2022-02-14"),]
restrictions_relaxed <- df[df$date > as.Date("2022-02-14"),]

ttest_covid_price = t.test(with_restrictions$price,restrictions_relaxed$price)
print(ttest_covid_price)
## 
##  Welch Two Sample t-test
## 
## data:  with_restrictions$price and restrictions_relaxed$price
## t = 34, df = 16821, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  79.4 89.0
## sample estimates:
## mean of x mean of y 
##       341       257

With the P-value = 2e-16, I successfully reject the Null Hypothesis that, there is a significant difference in the price before and after the travel restrictions were relaxed. Therefore, relaxation of travel restrictions have a significant impact on the price.


3. Stops

As The flight frequency between Delhi(National capital) and Mumbai(Financial capital) is high, I will consider these flights for further analysis

Effect of total stops on price

fig <- plot_ly(x = dm$stop, y = dm$price,type = "box",color = dm$stop)
fig <- fig %>% layout(title = "price range of flights based on number of stops",
                        xaxis = list(title = "Stops", showline = T, linewidth = 0.5),
                      yaxis = list(title = "Price", showline = T, linewidth = 1,range=c(0,180)),
                      showlegend = F)

fig

I observe that flights with 1 or more stops have higher prices than non stop flights.

Every time a plane lands to an airport, there is more things to consider than just to safely land: it pays a fee to airports, refuel the plane, restock snacks and drinks on the plane, etc., which all cost money. This has greater impact on domestic than the international flights (which operate on different business model).


conducted a ANOVA test to see if number of stops affect the price.

Null Hypothesis H0 : The mean of flight prices with different number of stops are same

Alternate Hypothesis H1 : The mean of flight prices with different number of stops are significantly different.

i.e. Number of stops affect the price of the flight ticket.

anova_class = aov(price ~ stop, data=dm)
#summary(anova_class)
xkabledply(anova_class, title = "ANOVA test summary for stop")
ANOVA test summary for stop
Df Sum Sq Mean Sq F value Pr(>F)
stop 2 5454053 2727026 1596 0
Residuals 14847 25376149 1709 NA NA

With the P-value = 2e-16, reject the Null Hypothesis that, there is significant difference in flight price mean with different number of stops. Thus, I am confident that the number of stops affect the price of the flight ticket.


4. Day and time of departure

Flight frequency over the day

fig <- plot_ly(time_count, type = 'scatter', mode = 'lines')%>%
  add_trace(x = ~hour, y = ~flight_frequency,name = "",line = list(color = "1f77b4"),showlegend = FALSE,
                hovertemplate = paste('<b>Hour</b>: %{x}<br>','<b>flight frequency</b>: %{y:.2f}'))

fig <- fig %>% layout(title = "Average flight frequency over 24 hours of the day",
                      xaxis = list(title = "Hour"), yaxis = list(title = "Flight frequency"))
options(warn = -1)

fig <- fig %>% layout(shapes = list( list(type = "rect",line = list(color = "rgba(42,34,94,0.2)"),
                                        x0 = 0, x1 = 4.99,y0 = 1,y1 = 80,fillcolor = "rgba(42,34,94,0.2)"),
                                     list(type = "rect",line = list(color = "rgba(242,145,2,0.2)"),
                                        x0 = 5, x1 = 9.99,y0 = 1,y1 = 80,fillcolor = "rgba(242,145,2,0.2)"),
                                     list(type = "rect",line = list(color = "rgba(253,234,16,0.2)"),
                                        x0 = 10, x1 = 14.99,y0 = 1,y1 = 80,fillcolor = "rgba(253,234,16,0.2)"),
                                     list(type = "rect",line = list(color = "rgba(147,192,27,0.2)"),
                                        x0 = 15, x1 = 19.99,y0 = 1,y1 = 80,fillcolor = "rgba(147,192,27,0.2)"),
                                     list(type = "rect",line = list(color = "rgba(48,128,201,0.2)"),
                                        x0 = 20, x1 = 24,y0 = 1,y1 = 80,fillcolor = "rgba(48,128,201,0.2)")), plot_bgcolor = "white")

fig <- fig %>% add_annotations(text = "Mid Night", x = 2.5,y = 65,showarrow = F) %>%
                add_annotations(text = "Morning", x = 7.5, y = 35,showarrow = F) %>%
                add_annotations(text = "Afternoon", x = 12.5,y = 35,showarrow = F) %>%
                add_annotations(text = "Evening", x = 17.5, y = 35,showarrow = F) %>%
                add_annotations(text = "Night", x = 22,y = 65,showarrow = F)


fig

Flight frequency is high in the morning and evening when compared to other time categories.

This is predictable, as people often prefer to travel to a destination, finish the work in hand, and return back home by the end of the day.

dm_over_week <- data.frame(dm %>% group_by(day) %>% summarise(average_price = mean(price)))

The weekend demand

fig <- plot_ly(dm_over_week, type = 'scatter', mode = 'lines')%>%
  add_trace(x = ~day, y = ~average_price,name = "",line = list(color = "1f77b4"),showlegend = FALSE,
            hovertemplate = paste('<b>Day</b>: %{x}<br>','<b>Average Price</b>: %{y:.2f}')) %>%
                layout(title = 'Average ticket price of economy seats over the week', plot_bgcolor = "white", xaxis = list(title = 'Day'), 
                          yaxis = list(title = 'Price'))



fig

As the demand for travel increases on weekends, there is a clear trend of price rise as we get closer to weekend, with maximum prices seen on Saturday and Sunday.

The lowest price is seen on Tuesday’s.


Best time to fly

From the previous plot I concluded that Tuesday offers the lowest price, now it is time to analyse what time on Tuesday offers the lowest price.

dm_friday <- dm[dm$day == "Friday",]
dm_friday$hour <- apply(X = dm_friday, FUN = find_hr, MARGIN = 1)

dm_over_day <- data.frame(dm_friday %>% group_by(hour) %>% summarise(average_price = mean(price)))

# head(dm_over_day)

fig <- plot_ly(dm_over_day, type = 'scatter', mode = 'lines')%>%
  add_trace(x = ~hour, y = ~average_price,name = "",line = list(color = "1f77b4"),showlegend = FALSE,
            hovertemplate = paste('<b>Hour</b>: %{x}<br>','<b>Average Price</b>: %{y:.2f}')) %>%
                layout(title = 'Average price of economy seats over tuesday', plot_bgcolor = "white", xaxis = list(title = 'Hour'), 
                          yaxis = list(range = c(25,110),title = 'Price'))


fig <- fig %>% layout(shapes = list( list(type = "rect",line = list(color = "rgba(42,34,94,0.2)"),
                                        x0 = 0, x1 = 4.99,y0 = 1,y0 = 25,y1 = 115,fillcolor = "rgba(42,34,94,0.2)"),
                                     list(type = "rect",line = list(color = "rgba(242,145,2,0.2)"),
                                        x0 = 5, x1 = 9.99,y0 = 1,y0 = 25,y1 = 115,fillcolor = "rgba(242,145,2,0.2)"),
                                     list(type = "rect",line = list(color = "rgba(253,234,16,0.2)"),
                                        x0 = 10, x1 = 14.99,y0 = 1,y0 = 25,y1 = 115,fillcolor = "rgba(253,234,16,0.2)"),
                                     list(type = "rect",line = list(color = "rgba(147,192,27,0.2)"),
                                        x0 = 15, x1 = 19.99,y0 = 1,y0 = 25,y1 = 115,fillcolor = "rgba(147,192,27,0.2)"),
                                     list(type = "rect",line = list(color = "rgba(48,128,201,0.2)"),
                                        x0 = 20, x1 = 24,y0 = 1,y0 = 25,y1 = 115,fillcolor = "rgba(48,128,201,0.2)")), plot_bgcolor = "white")

fig <- fig %>% add_annotations(text = "Mid Night", x = 2.5,y = 95,showarrow = F) %>%
                add_annotations(text = "Morning", x = 7.5, y = 55,showarrow = F) %>%
                add_annotations(text = "Afternoon", x = 12.5,y = 55,showarrow = F) %>%
                add_annotations(text = "Evening", x = 17.5, y = 55,showarrow = F) %>%
                add_annotations(text = "Night", x = 22,y = 95,showarrow = F)



fig

The price of flights is low during the odd hours of Night and Mid Night.

Considering the prices on different days and by hour, it is recommended to fly on Tuesday’s and in odd hours in between 10:00 Pm and 4:00 AM.


Model Building

df = read.csv("../data/Final_data_with_Covid_cases.csv")


df$hour <- apply(X = df, FUN = find_hr, MARGIN = 1)

df <- df %>% mutate(stop = recode(stop, "0" = 'non_stop', "1" = '1_stop', "2+" =  '2_or_more' ))


df$weekend <- with(df, factor(weekend, levels = c(1, 0), labels = c("Yes", "No")))


xkabledplyhead(df)
Head
airline flight_code date from dep_time dep_time_cat to arr_time arr_time_cat stop class month day weekend time_taken_hms time_taken_hours time_taken_minutes from_location to_location distance Daily_covid_cases price hour
Air India AI - 868 11-02-2022 Delhi 18:00:00 evening Mumbai 20:00:00 evening non_stop business February Friday No 2H 0M 0S 2.00 120 28.7041, 77.1025 19.0760, 72.8777 718 50407 320 18
Air India AI - 624 11-02-2022 Delhi 19:00:00 evening Mumbai 21:15:00 night non_stop business February Friday No 2H 15M 0S 2.25 135 28.7041, 77.1025 19.0760, 72.8777 718 50407 320 19
Air India AI - 531 11-02-2022 Delhi 20:00:00 evening Mumbai 20:45:00 evening 1_stop business February Friday No 24H 45M 0S 24.75 1485 28.7041, 77.1025 19.0760, 72.8777 718 50407 528 20
Air India AI - 839 11-02-2022 Delhi 21:25:00 night Mumbai 23:55:00 night 1_stop business February Friday No 26H 30M 0S 26.50 1590 28.7041, 77.1025 19.0760, 72.8777 718 50407 556 21
Air India AI - 544 11-02-2022 Delhi 17:15:00 evening Mumbai 23:55:00 night 1_stop business February Friday No 6H 40M 0S 6.67 400 28.7041, 77.1025 19.0760, 72.8777 718 50407 584 17

Before we started building models, I dropped few unwanted columns:

  1. flight code (as it has a lot of classes)

  2. date (instead I am using month and date)

  3. Departure and arrival time (instead I am using departure and arrival time in category)

  4. time_taken_in_minutes

  5. from_location and to_location

cols_drop = c("flight_code", "date", "dep_time", "arr_time", "time_taken_hms", "time_taken_minutes", "from_location", "to_location")

df_2 <- df[, !colnames(df) %in% cols_drop]

xkabledplyhead(df_2, title = "Dataframe after removing unwanted columns")
Dataframe after removing unwanted columns
airline from dep_time_cat to arr_time_cat stop class month day weekend time_taken_hours distance Daily_covid_cases price hour
Air India Delhi evening Mumbai evening non_stop business February Friday No 2.00 718 50407 320 18
Air India Delhi evening Mumbai night non_stop business February Friday No 2.25 718 50407 320 19
Air India Delhi evening Mumbai evening 1_stop business February Friday No 24.75 718 50407 528 20
Air India Delhi night Mumbai night 1_stop business February Friday No 26.50 718 50407 556 21
Air India Delhi evening Mumbai night 1_stop business February Friday No 6.67 718 50407 584 17


Removing outliers

After dropping unwanted columns, removed outliers after grouping the data based on class of the ticket.


Reasoning:

There is a huge difference between the prices of tickets of Business class and Economy class passengers, this is why I separated the data into two different groups and handled the outliers in them, which gives us more control on how the outliers are removed.

# grouping the data based on class and replacing the outlier values by NA and subsiquently removing them.

df_cleaned <- df_2 %>% group_by(class) %>% 
            mutate(price_cleaned= ifelse(price>quantile(price, 0.75, na.rm = T)+1.5*IQR(price, na.rm = T) |
                                           price<quantile(price, 0.25, na.rm = T)-1.5*IQR(price, na.rm = T), NA, price))



# colSums(is.na(df_cleaned)) 
# print(dim(df_cleaned))



df_cleaned <- df_cleaned %>% drop_na()
df_cleaned <-  df_cleaned[, colnames(df_cleaned)[colnames(df_cleaned) != 'price_cleaned']]
# print(dim(df_cleaned))



xkabledplyhead(df_cleaned, title = "Cleaned dataframe")
Cleaned dataframe
airline from dep_time_cat to arr_time_cat stop class month day weekend time_taken_hours distance Daily_covid_cases price hour
Air India Delhi evening Mumbai evening non_stop business February Friday No 2.00 718 50407 320 18
Air India Delhi evening Mumbai night non_stop business February Friday No 2.25 718 50407 320 19
Air India Delhi evening Mumbai evening 1_stop business February Friday No 24.75 718 50407 528 20
Air India Delhi night Mumbai night 1_stop business February Friday No 26.50 718 50407 556 21
Air India Delhi evening Mumbai night 1_stop business February Friday No 6.67 718 50407 584 17

17399 outlier observations are removed from the data.


Preprocessing

I did some additional prepocessing of the data, by separating numerical and categorical columns to deal with them appropriately.

For Numerical features

  1. Reduced the skewness of price with log transformation
  2. Scaled and Standardized numerical variables with mean of 0 and standard deviation of 1
numericVarNames <-  c('time_taken_hours', 'distance', 'Daily_covid_cases', "hour")
DFnumeric <- df_cleaned[, names(df_cleaned) %in% numericVarNames]
DFfactors <- df_cleaned[, !(names(df_cleaned) %in% numericVarNames)]
DFfactors <- DFfactors[, names(DFfactors) != 'price']

cat('There are', length(DFnumeric), 'numeric variables, and', length(DFfactors), 'factor variables')
## There are 4 numeric variables, and 10 factor variables

Skewness is a measure of the symmetry in a distribution. A symmetrical dataset will have a skewness equal to 0. So, a normal distribution will have a skewness of 0. Skewness essentially measures the relative size of the two tails. As a rule of thumb, skewness should be between -1 and 1. In this range, data are considered fairly symmetrical. In order to fix the skewness, I am taking the log for all numeric predictors with an absolute skew greater than 0.8 (actually: log+1, to avoid division by zero issues).


For Categorical features

  1. Converted them to Numerical by One-Hot-Encoding (creating Dummies)
#adding log(feature + 1) to remove skewness

for(i in 1:ncol(DFnumeric)){
        if (abs(skewness(DFnumeric[,i]))>0.8){
                DFnumeric[,i] <- log(DFnumeric[,i] +1)
        }
}

# standardizing the data
PreNum <- preProcess(DFnumeric, method=c("center", "scale"))
DFnorm <- predict(PreNum, DFnumeric)

This is done as XGBoost model cannot interpret the categorical columns, and features are supposed to be in a numerical data type.

DFdummies <- as.data.frame(model.matrix(~.-1, DFfactors))
xkabledplyhead(DFdummies, title = "Data frame after creating dummies")
Data frame after creating dummies
airlineAir India airlineAirAsia airlineGO FIRST airlineIndigo airlineSpiceJet airlineStarAir airlineTrujet airlineVistara fromChennai fromDelhi fromHyderabad fromKolkata fromMumbai dep_time_catearly morning dep_time_catevening dep_time_catmid night dep_time_catmorning dep_time_catnight toChennai toDelhi toHyderabad toKolkata toMumbai arr_time_catearly morning arr_time_catevening arr_time_catmid night arr_time_catmorning arr_time_catnight stop2_or_more stopnon_stop classeconomy monthMarch dayMonday daySaturday daySunday dayThursday dayTuesday dayWednesday weekendNo
1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1


After the additional preprocessing was done, One-Hot-Encoding resulted in 39 columns from 10 factor features. Combining both the numerical and categorical features back together to create a final dataframe.

combined <- cbind(DFnorm, DFdummies, price = df_cleaned$price)
xkabledplyhead(combined, title = "Final dataframe after preprocessing numerical and categorical features")
Final dataframe after preprocessing numerical and categorical features
time_taken_hours distance Daily_covid_cases hour airlineAir India airlineAirAsia airlineGO FIRST airlineIndigo airlineSpiceJet airlineStarAir airlineTrujet airlineVistara fromChennai fromDelhi fromHyderabad fromKolkata fromMumbai dep_time_catearly morning dep_time_catevening dep_time_catmid night dep_time_catmorning dep_time_catnight toChennai toDelhi toHyderabad toKolkata toMumbai arr_time_catearly morning arr_time_catevening arr_time_catmid night arr_time_catmorning arr_time_catnight stop2_or_more stopnon_stop classeconomy monthMarch dayMonday daySaturday daySunday dayThursday dayTuesday dayWednesday weekendNo price
-1.429 -0.0557 2.39 0.927 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 320
-1.394 -0.0557 2.39 1.113 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 320
1.772 -0.0557 2.39 1.298 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 528
2.019 -0.0557 2.39 1.484 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 556
-0.772 -0.0557 2.39 0.741 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 584


Splitting data into train and test

This helps us understand how well the model has learned on train data, by looking at the performance metrics on the test dataset. I split the dataset to 70% train data, and 30% test data. Also handled skewness on the price column after splitting the data into train and test data.

#make this example reproducible
set.seed(1)

#Use 70% of dataset as training set and remaining 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(combined), replace=TRUE, prob=c(0.7,0.3))
train  <- combined[sample, ]
test   <- combined[!sample, ]

x_train <- train[, colnames(train)[colnames(train) != 'price']]
y_train <- train$price
  
x_test <- test[, colnames(test)[colnames(test) != 'price']]
y_test <- test$price


# copying the actual y labels before transforming them (log transform), for future model evaluation
y_train_actuals <- duplicate(y_train)
y_test_actuals <- duplicate(y_test)


# removing skewness by log tranformation
cat("Skewness of train price before transformation", skewness(y_train_actuals))
## Skewness of train price before transformation 0.988
y_train <- log(y_train) #default is the natural logarithm, "+1" is not necessary as there are no 0's
cat("\nSkewness of train price after transformation", skewness(y_train))
## 
## Skewness of train price after transformation 0.461
cat("\n\nSkewness of test price before transformation", skewness(y_test_actuals))
## 
## 
## Skewness of test price before transformation 0.98
y_test <- log(y_test) 
cat("\nSkewness of test price after transformation", skewness(y_test))
## 
## Skewness of test price after transformation 0.455


After splitting the dataset, I have 198023 observations in train and 84839 observations in test. I also have reduced the Skewness of price in both train and test data sets by almost 50%.


Linear regression on single variables

First, built linear regression models with single variable based on their impact on price found during the EDA.

  1. Class

  2. Covid

  3. Stops

  4. Day and time of departure

Performance metrics used:

  1. RMSE (Root Mean Squared Error)

As the dependent variable (Price) is of numerical data type, RMSE helps us understand and compare the performance of different models use,

  1. MAPE (Mean Absolute Percentage Error)

For any one not fully into Data, it is difficult for them to understand what RMSE actually represents, where as MAPE is easily understandable, as if I predict something, how much percentage of error of the predicted value can they expect. This makes it easy for people of non data science back ground to understand the model performance better.


Class

model_class = lm(price ~ classeconomy, data = train)
print(summary(model_class))
## 
## Call:
## lm(formula = price ~ classeconomy, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -379.0  -26.1   -3.2   27.5  379.7 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   660.026      0.335    1970   <2e-16 ***
## classeconomy -586.679      0.406   -1446   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 84 on 198021 degrees of freedom
## Multiple R-squared:  0.914,  Adjusted R-squared:  0.914 
## F-statistic: 2.09e+06 on 1 and 198021 DF,  p-value: <2e-16
y_test_class_predictions <- predict(model_class, x_test)


Observations

The Performance of the Linear regression model based on the predictor Class has achieved an Adjusted R^2 of 0.914, which is a decent score considering only one variable is used.

Test RMSE,(Root Mean Squared Error) : 83.7

Test MAPE,(Mean Absolute Percentage Error) : 36.581% (model predicts with average error percentage of 36.581%)

Given only one variable is being used to predict the price of the ticket, MAPE of 36.6% can still be considered pretty good.


Covid

model_covid = lm(price ~ Daily_covid_cases, data = train)
print(summary(model_covid))
## 
## Call:
## lm(formula = price ~ Daily_covid_cases, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -306   -196   -166    288    817 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        259.532      0.638     407   <2e-16 ***
## Daily_covid_cases   31.267      0.638      49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 284 on 198021 degrees of freedom
## Multiple R-squared:  0.012,  Adjusted R-squared:  0.012 
## F-statistic: 2.4e+03 on 1 and 198021 DF,  p-value: <2e-16
y_test_covid_predictions <- predict(model_covid, x_test)


Observations

The univariate variate model with daily covid cases does not provide with satisfactory results. The model built with daily covid cases (which showed a strong effect on price during EDA) does not have enough predictive power with Ajdusted R^2 value of 0.012.

Test RMSE,(Root Mean Squared Error) : 285.078

Test MAPE,(Mean Absolute Percentage Error) : 240.583% (model predicts with average error percentage of 240.583%)

The performance of the model is really poor with an error percentage of 241%, which is not at all acceptable.


Stops

model_stop = lm(price ~ stop2_or_more   + stopnon_stop, data = train)
print(summary(model_stop))
## 
## Call:
## lm(formula = price ~ stop2_or_more + stopnon_stop, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -272   -218   -152    280    875 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    285.591      0.683   417.9   <2e-16 ***
## stop2_or_more -128.458      3.297   -39.0   <2e-16 ***
## stopnon_stop  -177.171      1.938   -91.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 279 on 198020 degrees of freedom
## Multiple R-squared:  0.0454, Adjusted R-squared:  0.0453 
## F-statistic: 4.7e+03 on 2 and 198020 DF,  p-value: <2e-16
y_test_stop_predictions <- predict(model_covid, x_test)


Observations

Though the variable shows importance, but the Adjusted R^2 is quite low 0.0453.

Test RMSE,(Root Mean Squared Error) : 285.078

Test MAPE,(Mean Absolute Percentage Error) : 240.583% (model predicts with average error percentage of 240.583%)

Again, the performance of the model is really poor with an error percentage of 241%.


day and time

model_dt = lm(price ~ hour  + dayMonday + dayTuesday + dayWednesday + dayThursday + daySaturday + daySunday, data = train)
print(summary(model_dt))
## 
## Call:
## lm(formula = price ~ hour + dayMonday + dayTuesday + dayWednesday + 
##     dayThursday + daySaturday + daySunday, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -264   -200   -170    293    792 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   259.595      1.766  147.00   <2e-16 ***
## hour            9.306      0.642   14.50   <2e-16 ***
## dayMonday      -1.261      2.426   -0.52   0.6030    
## dayTuesday     -3.918      2.428   -1.61   0.1065    
## dayWednesday   -3.327      2.428   -1.37   0.1707    
## dayThursday    -3.175      2.429   -1.31   0.1911    
## daySaturday     7.317      2.480    2.95   0.0032 ** 
## daySunday       4.859      2.468    1.97   0.0490 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286 on 198015 degrees of freedom
## Multiple R-squared:  0.00126,    Adjusted R-squared:  0.00123 
## F-statistic: 35.7 on 7 and 198015 DF,  p-value: <2e-16
y_test_dt_predictions <- predict(model_covid, x_test)


Observations

Again, here the Adjusted R^2 is quite low with 0.00123.

Test RMSE,(Root Mean Squared Error) : 285.078

Test MAPE,(Mean Absolute Percentage Error) : 240.583% (model predicts with average error percentage of 240.583%)


Conclusion : univariate linear models

From the models built above, I can conclude that the performance of the models are poor, except for the feature Class with Adjusted R^2 of 0.914 and RMSE of 83.7 and MAPE of 36.6. The price is related to the features in a more complex way and it needs more than just one variable to predict with good performance metric.


Building multivariate models

Base Linear Regression Model

I built a base Linear regression model with all the features in our dataset.

#make this example reproducible
set.seed(1)


model = lm(price ~ ., data = train)
print(summary(model))
## 
## Call:
## lm(formula = price ~ ., data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -368.4  -31.5   -1.6   32.7  378.1 
## 
## Coefficients: (2 not defined because of singularities)
##                             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                  685.902      1.107   619.45  < 2e-16 ***
## time_taken_hours               1.257      0.228     5.52  3.3e-08 ***
## distance                       1.119      0.265     4.23  2.3e-05 ***
## Daily_covid_cases             11.553      0.316    36.54  < 2e-16 ***
## hour                           7.505      0.833     9.01  < 2e-16 ***
## `airlineAir India`           -45.246      0.423  -106.99  < 2e-16 ***
## airlineAirAsia               -46.989      0.818   -57.47  < 2e-16 ***
## `airlineGO FIRST`            -25.856      0.699   -36.99  < 2e-16 ***
## airlineIndigo                -24.782      0.591   -41.96  < 2e-16 ***
## airlineSpiceJet              -20.042      1.020   -19.64  < 2e-16 ***
## airlineStarAir                10.794     11.828     0.91  0.36146    
## airlineTrujet                -57.300     13.952    -4.11  4.0e-05 ***
## airlineVistara                    NA         NA       NA       NA    
## fromChennai                   -4.919      0.620    -7.93  2.1e-15 ***
## fromDelhi                    -22.129      0.635   -34.87  < 2e-16 ***
## fromHyderabad                -20.057      0.637   -31.49  < 2e-16 ***
## fromKolkata                   12.058      0.661    18.25  < 2e-16 ***
## fromMumbai                    -0.593      0.563    -1.05  0.29206    
## `dep_time_catearly morning`   18.977      1.302    14.58  < 2e-16 ***
## dep_time_catevening            3.596      0.887     4.05  5.1e-05 ***
## `dep_time_catmid night`       28.263      2.286    12.36  < 2e-16 ***
## dep_time_catmorning           15.645      0.866    18.08  < 2e-16 ***
## dep_time_catnight             -4.665      1.351    -3.45  0.00055 ***
## toChennai                     -4.876      0.618    -7.89  3.1e-15 ***
## toDelhi                      -23.232      0.648   -35.88  < 2e-16 ***
## toHyderabad                  -20.013      0.634   -31.56  < 2e-16 ***
## toKolkata                     13.439      0.643    20.89  < 2e-16 ***
## toMumbai                       2.685      0.571     4.71  2.5e-06 ***
## `arr_time_catearly morning`  -11.960      0.891   -13.42  < 2e-16 ***
## arr_time_catevening            8.786      0.507    17.33  < 2e-16 ***
## `arr_time_catmid night`        9.326      0.883    10.56  < 2e-16 ***
## arr_time_catmorning            3.909      0.549     7.12  1.1e-12 ***
## arr_time_catnight              9.026      0.537    16.79  < 2e-16 ***
## stop2_or_more                 19.268      0.895    21.54  < 2e-16 ***
## stopnon_stop                 -73.553      0.620  -118.62  < 2e-16 ***
## classeconomy                -570.866      0.407 -1403.73  < 2e-16 ***
## monthMarch                   -12.589      0.689   -18.26  < 2e-16 ***
## dayMonday                     -2.482      0.635    -3.91  9.4e-05 ***
## daySaturday                   -1.013      0.641    -1.58  0.11428    
## daySunday                      2.407      0.644     3.74  0.00018 ***
## dayThursday                   -1.453      0.628    -2.31  0.02063 *  
## dayTuesday                    -3.010      0.628    -4.79  1.6e-06 ***
## dayWednesday                  -2.141      0.628    -3.41  0.00065 ***
## weekendNo                         NA         NA       NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 73.7 on 197981 degrees of freedom
## Multiple R-squared:  0.933,  Adjusted R-squared:  0.933 
## F-statistic: 6.77e+04 on 41 and 197981 DF,  p-value: <2e-16


As a result, features in our dataset has given a good results with Adjusted R^2 of 0.933.

y_pred = predict(model, test)

Test RMSE,(Root Mean Squared Error) : 73.576

Test MAPE,(Mean Absolute Percentage Error) : 37.907% (model predicts with average error of 37.907%)

With the MAPE value of 38%, this baseline linear model is considered pretty good.


Feature selection for Linear regression

I performed Feature selection among the variables and have in our dataset to come up with a better linear regression than the baseline model.

performed all three different methods, Adjusted R^2, BIC and Cp.

reg <- regsubsets(price~., data = train, nvmax = 5, nbest = 2, method = "exhaustive") 
## Reordering variables and trying again:
plot(reg, scale = "adjr2", main = "Adjusted R^2")

plot(reg, scale = "bic", main = "BIC")

plot(reg, scale = "Cp", main = "Cp")

summary(reg)
## Subset selection object
## Call: regsubsets.formula(price ~ ., data = train, nvmax = 5, nbest = 2, 
##     method = "exhaustive")
## 43 Variables  (and intercept)
##                             Forced in Forced out
## time_taken_hours                FALSE      FALSE
## distance                        FALSE      FALSE
## Daily_covid_cases               FALSE      FALSE
## hour                            FALSE      FALSE
## `airlineAir India`              FALSE      FALSE
## airlineAirAsia                  FALSE      FALSE
## `airlineGO FIRST`               FALSE      FALSE
## airlineIndigo                   FALSE      FALSE
## airlineSpiceJet                 FALSE      FALSE
## airlineStarAir                  FALSE      FALSE
## airlineTrujet                   FALSE      FALSE
## fromChennai                     FALSE      FALSE
## fromDelhi                       FALSE      FALSE
## fromHyderabad                   FALSE      FALSE
## fromKolkata                     FALSE      FALSE
## fromMumbai                      FALSE      FALSE
## `dep_time_catearly morning`     FALSE      FALSE
## dep_time_catevening             FALSE      FALSE
## `dep_time_catmid night`         FALSE      FALSE
## dep_time_catmorning             FALSE      FALSE
## dep_time_catnight               FALSE      FALSE
## toChennai                       FALSE      FALSE
## toDelhi                         FALSE      FALSE
## toHyderabad                     FALSE      FALSE
## toKolkata                       FALSE      FALSE
## toMumbai                        FALSE      FALSE
## `arr_time_catearly morning`     FALSE      FALSE
## arr_time_catevening             FALSE      FALSE
## `arr_time_catmid night`         FALSE      FALSE
## arr_time_catmorning             FALSE      FALSE
## arr_time_catnight               FALSE      FALSE
## stop2_or_more                   FALSE      FALSE
## stopnon_stop                    FALSE      FALSE
## classeconomy                    FALSE      FALSE
## monthMarch                      FALSE      FALSE
## dayMonday                       FALSE      FALSE
## daySaturday                     FALSE      FALSE
## daySunday                       FALSE      FALSE
## dayThursday                     FALSE      FALSE
## dayTuesday                      FALSE      FALSE
## dayWednesday                    FALSE      FALSE
## airlineVistara                  FALSE      FALSE
## weekendNo                       FALSE      FALSE
## 2 subsets of each size up to 6
## Selection Algorithm: exhaustive
##          time_taken_hours distance Daily_covid_cases hour `airlineAir India`
## 1  ( 1 ) " "              " "      " "               " "  " "               
## 1  ( 2 ) " "              " "      " "               " "  " "               
## 2  ( 1 ) " "              " "      " "               " "  " "               
## 2  ( 2 ) " "              " "      " "               " "  " "               
## 3  ( 1 ) " "              " "      " "               " "  " "               
## 3  ( 2 ) " "              " "      "*"               " "  " "               
## 4  ( 1 ) " "              " "      "*"               " "  " "               
## 4  ( 2 ) " "              " "      " "               " "  " "               
## 5  ( 1 ) " "              " "      "*"               " "  " "               
## 5  ( 2 ) " "              " "      "*"               " "  " "               
## 6  ( 1 ) " "              " "      "*"               " "  " "               
## 6  ( 2 ) " "              " "      " "               " "  " "               
##          airlineAirAsia `airlineGO FIRST` airlineIndigo airlineSpiceJet
## 1  ( 1 ) " "            " "               " "           " "            
## 1  ( 2 ) " "            " "               " "           " "            
## 2  ( 1 ) " "            " "               " "           " "            
## 2  ( 2 ) " "            " "               " "           " "            
## 3  ( 1 ) " "            " "               " "           " "            
## 3  ( 2 ) " "            " "               " "           " "            
## 4  ( 1 ) " "            " "               " "           " "            
## 4  ( 2 ) " "            " "               " "           " "            
## 5  ( 1 ) " "            " "               " "           " "            
## 5  ( 2 ) " "            " "               " "           " "            
## 6  ( 1 ) " "            " "               " "           " "            
## 6  ( 2 ) " "            " "               " "           " "            
##          airlineStarAir airlineTrujet airlineVistara fromChennai fromDelhi
## 1  ( 1 ) " "            " "           " "            " "         " "      
## 1  ( 2 ) " "            " "           "*"            " "         " "      
## 2  ( 1 ) " "            " "           " "            " "         " "      
## 2  ( 2 ) " "            " "           "*"            " "         " "      
## 3  ( 1 ) " "            " "           "*"            " "         " "      
## 3  ( 2 ) " "            " "           " "            " "         " "      
## 4  ( 1 ) " "            " "           "*"            " "         " "      
## 4  ( 2 ) " "            " "           "*"            " "         " "      
## 5  ( 1 ) " "            " "           "*"            " "         " "      
## 5  ( 2 ) " "            " "           "*"            " "         " "      
## 6  ( 1 ) " "            " "           "*"            " "         " "      
## 6  ( 2 ) " "            " "           "*"            " "         " "      
##          fromHyderabad fromKolkata fromMumbai `dep_time_catearly morning`
## 1  ( 1 ) " "           " "         " "        " "                        
## 1  ( 2 ) " "           " "         " "        " "                        
## 2  ( 1 ) " "           " "         " "        " "                        
## 2  ( 2 ) " "           " "         " "        " "                        
## 3  ( 1 ) " "           " "         " "        " "                        
## 3  ( 2 ) " "           " "         " "        " "                        
## 4  ( 1 ) " "           " "         " "        " "                        
## 4  ( 2 ) " "           " "         " "        " "                        
## 5  ( 1 ) " "           " "         " "        " "                        
## 5  ( 2 ) " "           "*"         " "        " "                        
## 6  ( 1 ) " "           "*"         " "        " "                        
## 6  ( 2 ) " "           "*"         " "        " "                        
##          dep_time_catevening `dep_time_catmid night` dep_time_catmorning
## 1  ( 1 ) " "                 " "                     " "                
## 1  ( 2 ) " "                 " "                     " "                
## 2  ( 1 ) " "                 " "                     " "                
## 2  ( 2 ) " "                 " "                     " "                
## 3  ( 1 ) " "                 " "                     " "                
## 3  ( 2 ) " "                 " "                     " "                
## 4  ( 1 ) " "                 " "                     " "                
## 4  ( 2 ) " "                 " "                     " "                
## 5  ( 1 ) " "                 " "                     " "                
## 5  ( 2 ) " "                 " "                     " "                
## 6  ( 1 ) " "                 " "                     " "                
## 6  ( 2 ) " "                 " "                     " "                
##          dep_time_catnight toChennai toDelhi toHyderabad toKolkata toMumbai
## 1  ( 1 ) " "               " "       " "     " "         " "       " "     
## 1  ( 2 ) " "               " "       " "     " "         " "       " "     
## 2  ( 1 ) " "               " "       " "     " "         " "       " "     
## 2  ( 2 ) " "               " "       " "     " "         " "       " "     
## 3  ( 1 ) " "               " "       " "     " "         " "       " "     
## 3  ( 2 ) " "               " "       " "     " "         " "       " "     
## 4  ( 1 ) " "               " "       " "     " "         " "       " "     
## 4  ( 2 ) " "               " "       " "     " "         " "       " "     
## 5  ( 1 ) " "               " "       " "     " "         "*"       " "     
## 5  ( 2 ) " "               " "       " "     " "         " "       " "     
## 6  ( 1 ) " "               " "       " "     " "         "*"       " "     
## 6  ( 2 ) " "               " "       " "     " "         "*"       " "     
##          `arr_time_catearly morning` arr_time_catevening
## 1  ( 1 ) " "                         " "                
## 1  ( 2 ) " "                         " "                
## 2  ( 1 ) " "                         " "                
## 2  ( 2 ) " "                         " "                
## 3  ( 1 ) " "                         " "                
## 3  ( 2 ) " "                         " "                
## 4  ( 1 ) " "                         " "                
## 4  ( 2 ) " "                         " "                
## 5  ( 1 ) " "                         " "                
## 5  ( 2 ) " "                         " "                
## 6  ( 1 ) " "                         " "                
## 6  ( 2 ) " "                         " "                
##          `arr_time_catmid night` arr_time_catmorning arr_time_catnight
## 1  ( 1 ) " "                     " "                 " "              
## 1  ( 2 ) " "                     " "                 " "              
## 2  ( 1 ) " "                     " "                 " "              
## 2  ( 2 ) " "                     " "                 " "              
## 3  ( 1 ) " "                     " "                 " "              
## 3  ( 2 ) " "                     " "                 " "              
## 4  ( 1 ) " "                     " "                 " "              
## 4  ( 2 ) " "                     " "                 " "              
## 5  ( 1 ) " "                     " "                 " "              
## 5  ( 2 ) " "                     " "                 " "              
## 6  ( 1 ) " "                     " "                 " "              
## 6  ( 2 ) " "                     " "                 " "              
##          stop2_or_more stopnon_stop classeconomy monthMarch dayMonday
## 1  ( 1 ) " "           " "          "*"          " "        " "      
## 1  ( 2 ) " "           " "          " "          " "        " "      
## 2  ( 1 ) " "           "*"          "*"          " "        " "      
## 2  ( 2 ) " "           " "          "*"          " "        " "      
## 3  ( 1 ) " "           "*"          "*"          " "        " "      
## 3  ( 2 ) " "           "*"          "*"          " "        " "      
## 4  ( 1 ) " "           "*"          "*"          " "        " "      
## 4  ( 2 ) " "           "*"          "*"          "*"        " "      
## 5  ( 1 ) " "           "*"          "*"          " "        " "      
## 5  ( 2 ) " "           "*"          "*"          " "        " "      
## 6  ( 1 ) " "           "*"          "*"          " "        " "      
## 6  ( 2 ) " "           "*"          "*"          "*"        " "      
##          daySaturday daySunday dayThursday dayTuesday dayWednesday weekendNo
## 1  ( 1 ) " "         " "       " "         " "        " "          " "      
## 1  ( 2 ) " "         " "       " "         " "        " "          " "      
## 2  ( 1 ) " "         " "       " "         " "        " "          " "      
## 2  ( 2 ) " "         " "       " "         " "        " "          " "      
## 3  ( 1 ) " "         " "       " "         " "        " "          " "      
## 3  ( 2 ) " "         " "       " "         " "        " "          " "      
## 4  ( 1 ) " "         " "       " "         " "        " "          " "      
## 4  ( 2 ) " "         " "       " "         " "        " "          " "      
## 5  ( 1 ) " "         " "       " "         " "        " "          " "      
## 5  ( 2 ) " "         " "       " "         " "        " "          " "      
## 6  ( 1 ) " "         " "       " "         " "        " "          " "      
## 6  ( 2 ) " "         " "       " "         " "        " "          " "

Based on the result above, three plots showed same pattern therefore I decided to only consider Adjusted R^2 plot. From here, picked two models with the same Adj R^2 value, but with the most number of features and the least number of features and tested which one to choose.


Testing two models with same Adj R^2 value, but different number of features and deciding which is better

Based on Regsubsets feature selection (with Adjusted R^2 > 0.93) the most important features for model_1(the most number of features) are:

  1. Daily_covid_cases

  2. fromChennai

  3. fromMumbai

  4. classeconomy

  5. monthMarch

  6. toMumbai

lm_fs_1 <- lm(price ~ Daily_covid_cases+fromChennai+fromMumbai+classeconomy+monthMarch + toMumbai, data = train)
summary(lm_fs_1)
## 
## Call:
## lm(formula = price ~ Daily_covid_cases + fromChennai + fromMumbai + 
##     classeconomy + monthMarch + toMumbai, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -413.2  -25.5   -1.1   27.0  398.2 
## 
## Coefficients:
##                   Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        664.247      0.635  1045.82   <2e-16 ***
## Daily_covid_cases   10.291      0.341    30.14   <2e-16 ***
## fromChennai          4.383      0.564     7.77    8e-15 ***
## fromMumbai           6.471      0.485    13.34   <2e-16 ***
## classeconomy      -584.599      0.399 -1465.16   <2e-16 ***
## monthMarch         -12.768      0.741   -17.22   <2e-16 ***
## toMumbai             6.985      0.482    14.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82.5 on 198016 degrees of freedom
## Multiple R-squared:  0.917,  Adjusted R-squared:  0.917 
## F-statistic: 3.63e+05 on 6 and 198016 DF,  p-value: <2e-16

Important features with similar Adj R^2 for model_2(the least number of features) are:

  1. fromchennai

  2. classecomomy

  3. monthMarch

lm_fs_2 <- lm(price ~ fromChennai+classeconomy+monthMarch, data = train)
summary(lm_fs_2)
## 
## Call:
## lm(formula = price ~ fromChennai + classeconomy + monthMarch, 
##     data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -399.4  -25.3   -1.6   27.5  389.8 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)   680.404      0.430  1581.25  < 2e-16 ***
## fromChennai     3.133      0.555     5.64  1.7e-08 ***
## classeconomy -584.898      0.400 -1461.79  < 2e-16 ***
## monthMarch    -31.629      0.405   -78.17  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82.7 on 198019 degrees of freedom
## Multiple R-squared:  0.916,  Adjusted R-squared:  0.916 
## F-statistic: 7.21e+05 on 3 and 198019 DF,  p-value: <2e-16


Testing two models using ANOVA

ano <- anova(lm_fs_1,lm_fs_2)
ano
## Analysis of Variance Table
## 
## Model 1: price ~ Daily_covid_cases + fromChennai + fromMumbai + classeconomy + 
##     monthMarch + toMumbai
## Model 2: price ~ fromChennai + classeconomy + monthMarch
##   Res.Df      RSS Df Sum of Sq   F Pr(>F)    
## 1 198016 1.35e+09                            
## 2 198019 1.36e+09 -3  -8365809 410 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since both the models have similar Adjusted R^2 values (0.917 vs 0.916) and the model_2 is built using fewer variables, I will choose the model_2.

y_pred = predict(lm_fs_2, test)
rmse(test$price,y_pred)
## [1] 82.5
mean(abs((test$price-y_pred)/test$price)) * 100
## [1] 32.9


Test RMSE,(Root Mean Squared Error) : 82.459

Test MAPE,(Mean Absolute Percentage Error) : 32.851% (model predicts with average error percentage of 32.851%)

With MAPE of 33%, which improved from 38%, I can conclude that this linear model after the feature selection gives us better model than the baseline model.


Decision Tree

For the next model, I performed a Decision tree.

treefitRpart <- rpart(price ~ ., data=train, control = list(maxdepth = 4, cp=0.005)) 
summary(treefitRpart)
## Call:
## rpart(formula = price ~ ., data = train, control = list(maxdepth = 4, 
##     cp = 0.005))
##   n= 198023 
## 
##        CP nsplit rel error xerror     xstd
## 1 0.91354      0    1.0000 1.0000 0.002566
## 2 0.02178      1    0.0865 0.0865 0.000511
## 3 0.00735      2    0.0647 0.0647 0.000370
## 4 0.00500      3    0.0573 0.0573 0.000319
## 
## Variable importance
##     classeconomy time_taken_hours     stopnon_stop airlineAir India 
##               94                2                2                1 
##   airlineVistara 
##                1 
## 
## Node number 1: 198023 observations,    complexity param=0.914
##   mean=260, MSE=8.16e+04 
##   left son=2 (135182 obs) right son=3 (62841 obs)
##   Primary splits:
##       classeconomy     < 0.5    to the right, improve=0.9140, (0 missing)
##       airlineVistara   < 0.5    to the left,  improve=0.1340, (0 missing)
##       airlineIndigo    < 0.5    to the right, improve=0.0833, (0 missing)
##       time_taken_hours < -0.333 to the left,  improve=0.0591, (0 missing)
##       airlineGO FIRST  < 0.5    to the right, improve=0.0391, (0 missing)
## 
## Node number 2: 135182 observations
##   mean=73.3, MSE=1.06e+03 
## 
## Node number 3: 62841 observations,    complexity param=0.0218
##   mean=660, MSE=2e+04 
##   left son=6 (4323 obs) right son=7 (58518 obs)
##   Primary splits:
##       time_taken_hours < -1.07  to the left,  improve=0.2810, (0 missing)
##       stopnon_stop     < 0.5    to the right, improve=0.2760, (0 missing)
##       airlineVistara   < 0.5    to the left,  improve=0.0915, (0 missing)
##       airlineAir India < 0.5    to the right, improve=0.0915, (0 missing)
##       toDelhi          < 0.5    to the right, improve=0.0331, (0 missing)
##   Surrogate splits:
##       stopnon_stop < 0.5    to the right, agree=0.998, adj=0.971, (0 split)
## 
## Node number 6: 4323 observations
##   mean=385, MSE=5.4e+03 
## 
## Node number 7: 58518 observations,    complexity param=0.00735
##   mean=680, MSE=1.5e+04 
##   left son=14 (20646 obs) right son=15 (37872 obs)
##   Primary splits:
##       airlineVistara   < 0.5    to the left,  improve=0.1350, (0 missing)
##       airlineAir India < 0.5    to the right, improve=0.1350, (0 missing)
##       distance         < 0.257  to the left,  improve=0.0244, (0 missing)
##       time_taken_hours < 0.582  to the right, improve=0.0210, (0 missing)
##       toKolkata        < 0.5    to the left,  improve=0.0208, (0 missing)
##   Surrogate splits:
##       airlineAir India      < 0.5    to the right, agree=1.000, adj=1.000, (0 split)
##       time_taken_hours      < 2.17   to the right, agree=0.654, adj=0.018, (0 split)
##       hour                  < -1.4   to the left,  agree=0.652, adj=0.015, (0 split)
##       dep_time_catmid night < 0.5    to the right, agree=0.650, adj=0.008, (0 split)
##       Daily_covid_cases     < 1.95   to the right, agree=0.649, adj=0.006, (0 split)
## 
## Node number 14: 20646 observations
##   mean=619, MSE=9.44e+03 
## 
## Node number 15: 37872 observations
##   mean=714, MSE=1.49e+04
fancyRpartPlot(treefitRpart, caption = "Decision Tree")

y_pred = predict(treefitRpart, test)

Test RMSE,(Root Mean Squared Error) : 68.331

Test MAPE,(Mean Absolute Percentage Error) : 35.048% (model predicts with average error percentage of 35.048%)


XGBoost

# Creating a data matrix, as xgboost does not understand dataframe
# data should be fed as a matrix

label_train <- y_train
label_test <- y_test

# put our testing & training data into two seperates Dmatrixs objects
dtrain <- xgb.DMatrix(data = as.matrix(x_train), label= label_train)
dtest <- xgb.DMatrix(data = as.matrix(x_test))

Created a hyper parameter matrix

default_param<-list(
        objective = "reg:linear",
        booster = "gbtree",
        eta=0.05, #default = 0.3
        gamma=0,
        max_depth=3, #default=6
        min_child_weight=4, #default=1
        subsample=1,
        colsample_bytree=1
)

Ran the model with 500 rounds

xgb_mod <- xgb.train(data = dtrain, params=default_param, nrounds = 500)
## [15:58:19] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
# y_train_XGBpred <- predict(xgb_mod, dtrain)
# y_train_predictions <- exp(y_train_XGBpred) #need to reverse the log to the real values
# rmse(y_train_actuals,y_train_predictions)
# mean(abs((y_train_actuals-y_train_predictions)/y_train_actuals)) * 100
y_test_XGBpred <- predict(xgb_mod, dtest)
y_test_predictions <- exp(y_test_XGBpred) #need to reverse the log to the real values

Test RMSE,(Root Mean Squared Error) : 54.302

Test MAPE,(Mean Absolute Percentage Error) : 15.577% (model predicts with average error of 15.577%)


Observations

  • Model performance has increased dramatically, achieving lowest test errors

  • XGBoost reduced the MAPE by 58.6% from (37.7 to 15.577)

library(Ckmeans.1d.dp) #required for ggplot clustering
mat <- xgb.importance (feature_names = colnames(dtrain),model = xgb_mod)
xgb.ggplot.importance(importance_matrix = mat[1:20], rel_to_first = TRUE)


Observations

The most important features for predicting the price of flight are Class, Daily Covid Cases, Duration of the flight, Airline and Distance


Best model

Model Performances

Based on the above table, I can observe XGBoost to have performed the best on the test data with RMSE of 54.302 and MAPE of 15.57%. Thus, I choose XGBoost as our best model.


Conclusion

In the 21st century the airlines deploy machine learning models to determine the price of a ticket, and the prices change multiple times over a day based on the demand and several other variables taken into consideration. I after analyzing few of these variables, can conclude the following:

  1. Class of the seat, impacts the price of the ticket (Business class seats are around 8 times more expensive than economy seats).

  2. Travel restrictions or any pandemic like situations impacts the number of flights scheduled in a day, which has its effect on the price of the ticket.

  3. Flights with more number of stops between departure and arrival city increases the price.

  4. The day and time of departure also has an affect on the price, with Tuesday offering the lower prices and odd hours such as late and midnight offer the least price over a week.

  5. The best combination of variables for predicting the flight price are Class, Daily Covid Cases, Duration of the flight, Airline and Distance.

  6. The best model for flight price prediction are in the order of XGBoost (15.57%), Decision Tree (35.347%) and Linear regression (32.907%)


I received many feedback from the midterm project to expand my dataset to International flights. Unfortunately, I did not have enough time to clean and process the new dataset with the similar features to the current dataset. So I had to keep working with the current dataset. What wI believe I can improve from here is by using an International flights dataset, I can bring more generalized result, because the current dataset is limited to India domestic flights. Next time, if I have a chance, I would like to perform the same analysis using more general dataset and hopefully that can give some more general ideas to people, and further will help when booking a flight ticket.


References

Flight Price Prediction. (n.d). Kaggle. Retrieved from https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction?select=economy.csv

India: Coronavirus Pandemic Country Profile. (2022,Nov 2). Our world in data . Retrieved from https://ourworldindata.org/coronavirus/country/india

India: Authorities to ease entry restrictions for international arrivals from Feb. 14. (2022,Feb 10). Crisis24. Retrieved from https://crisis24.garda.com/alerts/2022/02/india-authorities-to-ease-entry-restrictions-for-international-arrivals-from-feb-14-update-57

Why do layover flights cost more than direct flights? (n.d). Tripadvisor. Retrived from https://www.tripadvisor.com/ShowTopic-g1-i10702-k8711304-Why_do_layover_flights_cost_more_than_direct_flights-Air_Travel.html

Why are flights with stops more expensive? (n.d). Quora. Retrive from https://www.quora.com/Why-are-flights-with-stops-more-expensive